working with data in a scientific way that will produce new and reproducible insight
This course is an introduction to the
DS is being able to push through a lot of the difficulties that you have when you're dealing with either large or messy data. It includes
What is the key challenge in DS?
You are interested in answering questions with data and are in a situation where
In summary, you can work on problems where
From The Economist, The data deluge.
Over the last several years
From McKinsey Global Institute, Big data, The next frontier.
The other term that comes into play now is big data which is a sort of a new frontier: we have data in areas that we didn't used to have that data. For example, now
This DS track will have a statistical bend.
Machine learning and data analytics can be synthetized by Statistical Learning. Why?
Statistics is the science of learning from data.
t's very rare that you'll get a data set where all of the answers are really clear, and there's no uncertainty.
In any case where there is uncertainty, that's where statistics comes and plays a role.
From Drew Conway.
Raw data
Processed data
Examples:
Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. What is the task T in this setting?
Others: Reinforcement learning, recommender systems.
See slides
See slides
See slides
From Andrew Ng.
From Anfrew Ng.
From Andrew Ng.
You’re running a company, and you want to develop learning algorithms to address each of two problems.
Problem 1: You have a large inventory of identical items. You want to predict how many of these items will sell over the next 3 months.
Problem 2: You’d like software to examine individual customer accounts, and for each account decide if it has been hacked/compromised.
Should you treat these as classification or as regression problems?
Of the following examples, which would you address using an unsupervised learning algorithm? (Check all that apply.)
Suppose we observe
Objectives
On the basis of the training data we would like to:
We can model the relationship as \[ Y_i=f({\mathbf X}_i) + \epsilon_i \] where \(f\) is an unknown function and \(\epsilon\) is a random error (with mean \(0\)).
A simple example
From Al Sharif.
Estimating \(f\) gets more difficult as uncertainty (error size) increases.
Another example
From Al Sharif.
Statistical learning refers to using the data to "learn" \(f\).
Why do we care about estimating \(f\)?
For example,
Some ideas
A common example is market segmentation where we try to divide potential customers into groups based on their characteristics.
Supervised (statistical) learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs.
In unsupervised there are inputs (but no supervising output) from which we can learn structure.